Sublanguage Corpus Analysis Toolkit: A tool for assessing the representativeness and sublanguage characteristics of corpora

نویسندگان

Irina P. Temnikova

William A. Baumgartner

Negacy D. Hailu

Ivelina Nikolova

Tony McEnery

Adam Kilgarriff

Galia Angelova

K. Bretonnel Cohen

چکیده

Sublanguages are varieties of language that form "subsets" of the general language, typically exhibiting particular types of lexical, semantic, and other restrictions and deviance. SubCAT, the Sublanguage Corpus Analysis Toolkit, assesses the representativeness and closure properties of corpora to analyze the extent to which they are either sublanguages, or representative samples of the general language. The current version of SubCAT contains scripts and applications for assessing lexical closure, morphological closure, sentence type closure, over-represented words, and syntactic deviance. Its operation is illustrated with three case studies concerning scientific journal articles, patents, and clinical records. Materials from two language families are analyzed-English (Germanic), and Bulgarian (Slavic). The software is available at sublanguage.sourceforge.net under a liberal Open Source license.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

SuperCAT: The (New and Improved) Corpus Analysis Toolkit

This paper reports SuperCAT, a corpus analysis toolkit. It is a radical extension of SubCAT, the Sublanguage Corpus Analysis Toolkit, from sublanguage analysis to corpus analysis in general. The idea behind SuperCAT is that representative corpora have no tendency towards closure-that is, they tend towards infinity. In contrast, non-representative corpora have a tendency towards closure-roughly,...

متن کامل

Recognizing Sublanguages in Scientific Journal Articles through Closure Properties

It has long been realized that sublanguages are relevant to natural language processing and text mining. However, practical methods for recognizing or characterizing them have been lacking. This paper describes a publicly available set of tools for sublanguage recognition. Closure properties are used to assess the goodness of fit of two biomedical corpora to the sublanguage model. Scientific jo...

متن کامل

How Should A Large Corpus Be Built? - A Comparative Study Of Closure In Annotated Newspaper Corpora From Two Chinese Sources, Towards Building A Larger Representative Corpus Merged From Representative Sublanguage Collections

This study measures comparative lexical and syntactic closure rates in annotated Chinese newspaper corpora from the Academica Sinica Balanced Corpus and the University of Pennsylvania's Chinese Treebank. It then draws inferences as to how large such corpora need be to be representative models of subject-matterconstrained language domains within the same genre. Future large corpora should be bui...

متن کامل

Measuring Web-Corpus Randomness: A Progress Report

The Web allows fast and inexpensive construction of general purpose corpora, i.e., corpora that are not meant to represent a specific sublanguage, but a language as a whole, and thus should be unbiased with respect to domains and genres. In this paper, we present an automated, quantitative, knowledge-poor method to evaluate the randomness (with respect to a number of non-random partitions) of a...

متن کامل

A New Direction

There have been a number of theoretical studies devoted to the notion of sublanguage Further more there are some successful natural language processing systems which have explicitly or im plicitly utilized sublanguage restrictions How ever two big problems are still unsolved to utilize the sublanguage notion automatic de nition and dynamic identi cation of a text to sublan guage and automatic l...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

LREC ... International Conference on Language Resources & Evaluation : [proceedings]. International Conference on Language Resources and Evaluation

دوره 2014 شماره

صفحات -

تاریخ انتشار 2014

Sublanguage Corpus Analysis Toolkit: A tool for assessing the representativeness and sublanguage characteristics of corpora

نویسندگان

چکیده

منابع مشابه

SuperCAT: The (New and Improved) Corpus Analysis Toolkit

Recognizing Sublanguages in Scientific Journal Articles through Closure Properties

How Should A Large Corpus Be Built? - A Comparative Study Of Closure In Annotated Newspaper Corpora From Two Chinese Sources, Towards Building A Larger Representative Corpus Merged From Representative Sublanguage Collections

Measuring Web-Corpus Randomness: A Progress Report

A New Direction

عنوان ژورنال:

اشتراک گذاری